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We give a brief introduction to the AIXI 
model, which unifies and overcomes the limita- 
tions of sequential decision theory and universal 
Solomonoff induction. While the former theory is 
suited for active agents in known environments, 
the latter is suited for passive prediction of un- 
known environments. 



Introduction: Every inductive inference problem can 
be brought into the following form: Given a string 
x\X2---Xt-\ = Xi-,t— i = x<t, take a guess at its continu- 
ation xt- We will assume that the strings which have to 
be continued are drawn from a probability distribution 
p. The maximal prior information a prediction algo- 
rithm can possess is the exact knowledge of p, but often 
the true distribution is unknown. Instead, prediction is 
based on a guess p of p. We expect that a predictor 
based on p performs well, if p is close to p or converges 
to p. 

Universal probability distribution: Let M. := 
{p,\, p,2, ...} be a finite or countable set of candidate prob- 
ability distributions on strings. We define a weighted 
average on Ai, 



= 1, 



> 0. 



We call £ universal relative to M, as it multiplica- 
tively dominates all distributions in A4, i.e. £(xi :n ) > 
■ Pi(xi. n ) for all pi G M.. In the following, we as- 
sume that M. is known and contains the true distribu- 
tion from which x\Xi-.- is sampled, i.e. p e M. The 
condition p € M. is not a serious constraint if we include 
all computable probability distributions in M. with high 
weights assigned to simple pi. Solomonoff-Levin's uni- 
versal semi-measure is obtained if we include all enumer- 
able semi-measures in M. with weights u> Mi ~ 2~ K ^ i ', 
where K(pi) is the length of the shortest program for 
Pi |U E| One can show that the conditional £ and p 
probabilities rapidly converge to each other: 

£(x t \x <t ) -> p{x t \x <t ) with p probability 1. (1) 

Since the conditional probabilities are the basis of the 
decision algorithms considered in this work, we expect a 
good prediction performance if we use £ as a guess of p. 



Bayesian decisions: Let £ Xtyt G [0, 1] be the received 
loss when predicting y t € y, but x t <^X turns out to be 
the true t th symbol of the sequence. Let L nA be the 
total expected loss for the first n symbols of the Bayes 
predictor A p which minimizes the p expected loss. For in- 
stance for X = y= {0, 1}, A p is a threshold strategy with 

yf" = 0/1 for p(l\x <t )> 7 , where 7 := £oi J^_ £ii . 
Let A be any prediction scheme (deterministic or prob- 
abilistic) with no constraint at all, taking any action 
yt with total expected loss L nA . If p is known, A p 
is obviously the best prediction scheme in the sense of 
achieving minimal expected loss L nA < L nA for any A. 
For the predictor A^ based on the universal distribution 
£, on can show L nA( /L nAli = 1 + 0( v / K(p)/L nA J, i.e. 
Aj has optimal asymptotics for L nAfi — > oo with rapid 
convergence of the quotient to 1. If £ooA,» is finite, then 
also LooA £ [UEj. 

More active systems: Prediction means guessing the 
future, but not influencing it. One step in the direction 
to more active systems was to allow the A system to act 
and to receive a loss t XtVt depending on the action y t and 
the outcome Xt- The probability p is still independent 
of the action, and the loss function t l has to be known 
in advance. This ensures that the greedy A M strategy is 
still optimal. The loss function can also be generalized 
to depend on the history x <t and on t. 

Agents in known probabilistic environments: The 

full model of an acting agent influencing the environ- 
ment has been developed in [21 E]- The probability of 
the next symbol (input, perception) xt depends in this 
case not only on the past sequence £<t but also on the 
past actions (outputs) yi :t , i.e. p = p(x t \x <t yi:t)- We call 
probability distributions of this form chronological. The 
total p expected loss is + ■■■ +£ n )p{x\; n \y\ :n ), 

where we assumed a total number of n interaction cycles. 
Action yt(x<ty<t) and loss function ^(xutyut) may de- 
pend on the complete history, which allows planning and 
delayed loss assignment. 

Sequential decision theory: The goal is to perform 
the actions which minimize the total p expected loss: 



y t := argminV'...minV'(£ 1 -|- ... +£ n )p{x 1:n \y 1:n ), (2) 
yt y n 

x t x n 

L nA = minV '...minV^ 1 + ... +£ n )p(x hn \y 1:n ). (3) 
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The minimization over yt is in chronological order to 
correctly incorporate the dependency of Xt and yt on 
the history. Note that y t only depends on the known 
history x <t y<t, whereas minima and expectations are 
taken over the unknown Xt-nVt-.n variables. The policy 
(called Akt model) is optimal in the sense that no 
other policy leads to lower /j,-expected loss. 

Bellman equations: In the case that £ l is independent 
of y<t and \i is independent of yi :n , policy @ reduces to 
the greedy Bayes A M strategy. For (completely observ- 
able) Markov Decision Processes ix = y,{x t \x t -iyt) © 
and @ can be written as recursive Bellman equations 
of sequential decision theory with state space X , action 
space y, state transition matrix [i(xt\xt-iyt) , rewards 
—£*, etc. The general (non-MDP) case may also be (ar- 
tificially) reduced to Bellman equations by identifying 
complete histories x<ty<t with states and n{xt\x<tyi-.t) 
with the state transition matrix. Due to the use of com- 
plete histories as state space, the AI/i model neither as- 
sumes stationarity, nor the Markov property, nor com- 
plete accessibility of the environment. But since every 
state occurs at most once in the lifetime of the system 
the explicit formulation J5J) is more useful than a pseudo- 
recursive Bellman equation form. There is no principle 
problem in determining y^ as long as ll is known and 
computable and X , y and n are finite. 

Reinforcement learning for unknown environ- 
ment: Things dramatically change if /i is unknown. 
Reinforcement learning algorithms are commonly used 
in this case to learn the unknown // (or directly a value 
function). They succeed if the state space is either small 
or has effectively been made small by generalization or 
function approximation techniques. In almost all ap- 
proaches, the solutions are either ad hoc, or work in 
restricted domains only, or have serious problems with 
state space exploration versus exploitation, or have non- 
optimal learning rate. Below we propose the AI£ model 
as a universal and optimal solution to these problems. 

Unknown loss function: Furthermore, the loss func- 
tion £ \x\-tyx-t) may also be unknown, but there is an 
easy "solution" to this problem. The specification of the 
loss function can be absorbed in the probability distri- 
bution [i by increasing the input space X. Let xt=x' t lt, 
where x' t is the regular input, It is interpreted as the 
loss, i l (x\.,ty\:t) is replaced by l t in @ and J3J), and 
ll is only non-zero if It is consistent with the loss, i.e. 
It — £ {xi-.tVi-.t)- In this way all possible unknowns are 
absorbed in ll. 

The universal AI£ model: Encouraged by the good 
performance of the universal sequence predictor Aj , we 
propose a new model, where the probability distribu- 
tion /i is learned indirectly by replacing it with a uni- 
versal prior £. We define €(x 1:n \yt. n ) := J2^eM w ^ ' 
[ii{%i:n\yi:n) as a weighted sum over chronological proba- 
bility distributions in M.. Convergence £,{x n \x <n y\ :n ) — > 
n(x n \x <n yi- n ) can be proven analogously to (JJJ. Replac- 



ing [i by £ in (J2J the AI£ system outputs 
y t := argminV" ...minV7Z t + ... +l n )£(%i:n\yi:n) (4) 

Vt * — ' Vn 

in cycle t given the history x<ty<t, where xt = x' t h- The 
largest class M. which is necessary from a computational 
point of view is the set of all enumerable chronological 
semi-measures with weights w fli ^2~ K ^ i \ where K(fXi) 
is the Kolmogorov complexity of Lit. Apart from the 
dependence on the horizon n and unimportant details, 
the AI£ system is uniquely defined by Q without ad- 
justable parameters. It does not depend on any assump- 
tion about the environment apart from being generated 
by some computable (but unknown!) probability distri- 
bution in M.. 

Universally optimal AI systems: We want to call 
an AI model universal, if it is ^-independent (unbiased, 
model-free) and is able to solve any solvable problem 
and learn any learnable task. Further, we call a uni- 
versal model, universally optimal, if there is no program 
which can solve or learn significantly faster (in terms of 
interaction cycles). As the AI£ model is parameterless, 
£ rapidly converges to fi in the sense of 0J, the Al/j, 
model is itself optimal, and we expect no other model to 
converge faster to Alfi (in some sense) by analogy to the 
sequence prediction case, we risk the conjecture that AI£ 
is such a universally optimal system. Further support is 
given in E] by a detailed analysis of the behaviour 
of AI£ for various problem classes, including prediction, 
optimization, games, and supervised learning. We dis- 
cuss in which sense AI£ overcomes some fundamental 
problems in reinforcement learning, like generalization, 
optimal learning rates, exploration versus exploitation, 
etc. Computational issues are also addressed. 
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